Diachronic Changes in Text Complexity in 20th Century English Language: An NLP Approach
نویسندگان
چکیده
A syntactically complex text may represent a problem for both comprehension by humans and various NLP tasks. A large number of studies in text simplification are concerned with this problem and their aim is to transform the given text into a simplified form in order to make it accessible to the wider audience. In this study, we were investigating what the natural tendency of texts is in 20th century English language. Are they becoming syntactically more complex over the years, requiring a higher literacy level and greater effort from the readers, or are they becoming simpler and easier to read? We examined several factors of text complexity (average sentence length, Automated Readability Index, sentence complexity and passive voice) in the 20th century for two main English language varieties – British and American, using the ‘Brown family’ of corpora. In British English, we compared the complexity of texts published in 1931, 1961 and 1991, while in American English we compared the complexity of texts published in 1961 and 1992. Furthermore, we demonstrated how the state-of-the-art NLP tools can be used for automatic extraction of some complex features from the raw text version of the corpora.
منابع مشابه
Diachronic Stylistic Changes in British and American Varieties of 20th Century Written English Language
In this paper we present the results of a study investigating the diachronic changes of four stylistic features: average sentence length, Automated Readability Index, lexical density and lexical richness in 20th century written English language. All experiments were conducted on the largest existing diachronic corpora of British and American English – the Brown ‘family’ corpora, employing NLP t...
متن کاملStyle of Religious Texts in 20th Century
In this study, we present the results of the investigation of diachronic stylistic changes in 20th century religious texts in two major English language varieties – British and American. We examined a total of 146 stylistic features, divided into three main feature sets: (average sentence length, Automated readability index, lexical density and lexical richness), part-of-speech frequencies and ...
متن کاملTowards a Better Exploitation of the Brown 'Family' Corpora in Diachronic Studies of British and American English Language Varieties
Since the 1990s, the Brown ‘family’ corpora have been widely used for various diachronic studies of 20th century English language. However, the existing methodologies failed to exploit its full potential as they only used the four main text categories. In this paper, we present the results of two experiments on diachronic changes of the Coleman-Liau readability Index (CLI) in British and Americ...
متن کاملUsing Comparable Corpora to Track Diachronic and Synchronic Changes in Lexical Density and Lexical Richness
This study from the area of language variation and change is based on exploitation of the comparable diachronic and synchronic corpora of 20th century British and American English language (the ‘Brown family’ of corpora). We investigate recent changes of lexical density and lexical richness in two consecutive thirty-year time gaps in British English (1931–1961 and 1961–1991) and in 1961–1992 in...
متن کاملA fully data-driven method to identify (correlated) changes in diachronic corpora
In this paper, a method for measuring synchronic corpus (dis-)similarity put forward by Kilgarriff (2001) is adapted and extended to identify trends and correlated changes in diachronic text data, using the Corpus of Historical American English (Davies 2010a) and the Google Ngram Corpora (Michel et al. 2010a). This paper shows that this fully data-driven method, which extracts word types that h...
متن کامل